Week 8 - Discussion Questions

These are example discussion points for you to think about before class. You are not expected to engage with all of them — pick the ones that speak most directly to your own research, and bring two or three rough answers to the in-class session. The full description of how to use these pages, including what the question tags mean, is on the Week 1 Discussion page.

Sub-lessons

What Multimodal AI Can See, Hear, and Read

Calibrate The lesson's core distinction is between “reading” (extracting tokens from non-text input) and “understanding” (working with what the input means). Pick a recent multimodal-AI claim you have seen and locate where the claim sits on that distinction.
Apply Across the four modalities (image, document, audio, video), which is currently the highest-value addition to your own research workflow, and which is most overhyped for what you do?
Critical The model landscape changes monthly. Which of the modality-specific capabilities you depend on today are most likely to be obviously outdated by this time next year — and what habit would let you notice the change?
Connect Week 2 introduced AI image generation as a separate model family; Week 7 introduced AI-assisted data analysis as a separate workflow. This sub-lesson rolls those (and audio and video) into a single multimodal frame. At what point does “multimodal” stop being a list of features inherited from those earlier weeks and start being a single integrated capability the earlier weeks did not anticipate?

AI and Scientific Images

Calibrate The CharXiv finding (47.1% best model vs 80.5% humans on reasoning over real scientific charts) is the headline calibration check for this lesson. Pick a frontier-model release from the last two months and predict how it would do on CharXiv. What would change your prediction?
Apply Pick a real figure from your own field. Ask AI to interpret it. Where it is right for the wrong reason, what does that say about whether you can use it for anything more than a sanity check?
Critical The “correct answer, wrong reasoning” problem is presented as serious for science. Is the right response to (a) avoid the tools entirely for image reasoning, (b) require human verification of the reasoning chain, or (c) accept it and weight outputs accordingly? Defend a choice.
Connect Week 7's “silent error” problem in AI-assisted data analysis is the closest cousin of the “correct answer, wrong reasoning” problem here. Compare the two. Does the same verification habit catch both, or does scientific-image reasoning need a genuinely different kind of check?

Transcription and Audio Analysis

Calibrate Run a transcription on a recording in your most-used language and one in a language with less training-data coverage. Use the lesson's framing of “hallucination in transcription” to compare what each gets wrong. Where do the errors cluster?
Apply For a qualitative researcher in your field, what is the minimum transcription quality that you would treat as “good enough to code without re-listening,” and what does that minimum cost in time you would still need to spend reviewing?
Critical The lesson covers South African and African-language transcription explicitly. If reliable transcription in your target languages is not yet available, is that a reason to delay research that depends on it, or to do the research differently?
Connect Week 2 covered training-data scale and curation as a determinant of what models can and cannot do; Week 4 covered consent and disclosure as ethical obligations for researchers. Both bear directly on transcription in under-supported languages. Pick a fieldwork scenario you might actually face and walk through how those two earlier weeks would shape your decision about whether to use AI transcription at all.

Video and Multimodal Workflows

Calibrate The lesson notes that current Pro-tier models can process roughly an hour of video in a single call. Take that capability as given and identify the research use-case in your field where it would matter most. What would still be left undone?
Apply Sketch a combined-modality workflow (e.g. video + transcript + figures) for a single research task. What does the combination unlock that the individual modalities don't? Where does the combination add new failure modes?
Critical “Native multimodal vs text-centric architectures” is presented as an important architectural distinction. Is it actually load-bearing for what you would notice as a user, or is it more an inside-the-lab question?
Connect Week 2's LLM architecture lesson covered context-window dynamics and positional encoding as the things that determine what a model can keep track of over long inputs. Long-form video processing pushes those mechanisms harder than almost any text use-case. Predict, from the architectural picture in Week 2, where you would expect long-form video processing to fail first — and check the prediction against what the lesson actually says.

Document Intelligence — PDFs, Tables, and Forms

Calibrate Pick a multi-page PDF you have wrestled with. Hand it to a current document-intelligence tool. Where does the tool succeed beyond your expectations, and where does it fail in ways you couldn't have predicted?
Apply For the kind of supplementary data table you encounter most in your field, sketch a minimal extraction protocol that combines an AI tool with a human spot-check. What are you willing to delegate, and what stays in your hands?
Critical Complex tables are described as “the hardest unsolved problem” in document intelligence. Is that fairly described as unsolved, or just under-emphasised? What would “solved” even look like?
Connect The “lost in the middle” problem is the long-context-attention problem from Week 2 in its most practical form. How does seeing it in document intelligence change your reading of the long-context capability claims you encounter elsewhere?

Hands-On Activities and Assessment

Calibrate Compare cohort outputs on Activity 1 (Figure Analysis). Where two researchers got different interpretations from the same AI tool on the same figure, what does the divergence say about the noise floor of multimodal reasoning?
Apply Build a personal “multimodal verification checklist” specific to your research modality of choice. Make it short enough that you will actually use it on a real task.
Critical The Transcription and Verification activity rewards the cohort for spotting transcription errors. Is your error-spotting rate likely to be the same when the transcript is for a study you are emotionally invested in?
Connect Compare the verification habits demanded by these activities with the verification protocols introduced in Week 5 (literature), Week 6 (writing), and Week 7 (data analysis). Of the four, which is closest to the kind of careful reading the multimodal activities demand, and which is furthest? What does that say about how transferable “AI verification” really is across modalities?